Client Report - Finding Relationships in Baseball

Course DS 250

Author

[Maia Faith Chambers]

Show the code
import pandas as pd 
import numpy as np
import sqlite3
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

QUESTION|TASK 1

Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

The query below returns the results of two players who attended BYU-Idaho: Mark L. Lindsma and Garrett Stephenson, identified by their player IDs as lindsma01 and stephga01, respectively. Both players had long and varied careers in Major League Baseball. Lindsma played from 2007 to 2014, while Stephenson’s career spanned from 1997 to 2003. Lindsma’s salary peaked later in his career, earning $4,000,000 in 2014 with the Chicago White Sox. In contrast, Stephenson reached his highest salary of $1,025,000 in 2001, during the mid-point of his career with the St. Louis Cardinals. This analysis highlights the financial success of these two BYU-Idaho alumni in the professional baseball world.

Show the code
# Query salary info for BYU-Idaho players using known playerIDs
query_byu_players = """
SELECT 
    playerID, 
    salary,
    yearID,
    teamID
FROM 
    Salaries
WHERE 
    playerID IN ('lindsma01', 'stephga01')
ORDER BY 
    salary DESC;
"""
df_baseball1 = pd.read_sql_query(query_byu_players, con)
df_baseball1["schoolID"] = "idbyuid"
df_baseball1
playerID salary yearID teamID schoolID
0 lindsma01 4000000.0 2014 CHA idbyuid
1 lindsma01 3600000.0 2012 BAL idbyuid
2 lindsma01 2800000.0 2011 COL idbyuid
3 lindsma01 2300000.0 2013 CHA idbyuid
4 lindsma01 1625000.0 2010 HOU idbyuid
5 stephga01 1025000.0 2001 SLN idbyuid
6 stephga01 900000.0 2002 SLN idbyuid
7 stephga01 800000.0 2003 SLN idbyuid
8 stephga01 550000.0 2000 SLN idbyuid
9 lindsma01 410000.0 2009 FLO idbyuid
10 lindsma01 395000.0 2008 FLO idbyuid
11 lindsma01 380000.0 2007 FLO idbyuid
12 stephga01 215000.0 1999 SLN idbyuid
13 stephga01 185000.0 1998 PHI idbyuid
14 stephga01 150000.0 1997 PHI idbyuid
Show the code
#This part is extra, I wanted to see how it would look in a lets plot showing the comparrison of salary over the two players careers.
# Line plot to visualize their salary over time
query_compare_salaries = """
SELECT 
    playerID, 
    yearID, 
    salary
FROM 
    Salaries
WHERE 
    playerID IN ('lindsma01', 'stephga01')
ORDER BY 
    yearID;
"""

df_salaries = pd.read_sql_query(query_compare_salaries, con)

player_colors = {
    'lindsma01': '#27251F',
    'stephga01': '#C41E3A'
}

ggplot(df_salaries, aes(x='yearID', y='salary', color=as_discrete('playerID'))) + \
    geom_line(size=2) + \
    scale_color_manual(values=player_colors) + \
    scale_x_continuous(breaks=list(range(df_salaries['yearID'].min(), df_salaries['yearID'].max()+1))) + \
    ggtitle("Salary Comparison: Mark Lindsma vs Garrett Stephenson") + \
    xlab("Year") + \
    ylab("Salary (USD)") + \
    theme_minimal()

QUESTION|TASK 2

This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
a. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
a. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.

This three-part question looked at batting averages, which are calculated by dividing hits by at-bats. For the first part, I found the top batting averages for players who had at least one at-bat in a season. Not surprisingly, all five players had a perfect 1.000 average, meaning they got a hit every time they were officially at bat that year. But most of them probably only had one or two chances, so the data isn’t super meaningful yet.

Show the code
# Include and execute your code here
query_2a = """
SELECT 
    playerID, 
    yearID, 
    ROUND(CAST(H AS FLOAT)/AB, 3) AS batting_avg
FROM 
    Batting
WHERE 
    AB > 0
ORDER BY 
    batting_avg DESC, playerID ASC
LIMIT 5;
"""

df_2a = pd.read_sql_query(query_2a, con)
df_2a
playerID yearID batting_avg
0 aberal01 1957 1.0
1 abernte02 1960 1.0
2 abramge01 1923 1.0
3 acklefr01 1964 1.0
4 alanirj01 2019 1.0

In the second part, I made the filter a bit stricter by only including players with at least 10 at-bats. This helped bring out more reliable results. The top players now had averages between .571 and .643, with names like Manny Ny (1974) and Carson M. (2013) leading the list. These guys still had great seasons, but the extra filter helps avoid flukes.

Show the code
query_2b = """
SELECT 
    playerID, 
    yearID, 
    ROUND(CAST(H AS FLOAT)/AB, 3) AS batting_avg
FROM 
    Batting
WHERE 
    AB >= 10
ORDER BY 
    batting_avg DESC, playerID ASC
LIMIT 5;
"""

df_2b = pd.read_sql_query(query_2b, con)
df_2b
playerID yearID batting_avg
0 nymanny01 1974 0.643
1 carsoma01 2013 0.636
2 altizda01 1910 0.600
3 johnsde01 1975 0.600
4 silvech01 1948 0.571

For the third part, I calculated batting averages over entire careers by summing up all hits and at-bats for each player and then dividing. I also made sure to only include players with at least 100 at-bats total. The results brought out legends like Ty Cobb (.366), Rogers Hornsby (.358), and Joe Jackson (.356). These are the kind of averages that reflect long-term consistency, not just one standout season.

Show the code
query_2c = """
SELECT 
    playerID, 
    ROUND(SUM(CAST(H AS FLOAT))/SUM(AB), 3) AS career_batting_avg
FROM 
    Batting
GROUP BY 
    playerID
HAVING 
    SUM(AB) >= 100
ORDER BY 
    career_batting_avg DESC, playerID ASC
LIMIT 5;
"""

df_2c = pd.read_sql_query(query_2c, con)
df_2c
playerID career_batting_avg
0 cobbty01 0.366
1 barnero01 0.360
2 hornsro01 0.358
3 jacksjo01 0.356
4 meyerle01 0.356

QUESTION|TASK 3

Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Lets-Plot to visualize the comparison. What do you learn?

I used the Seattle Mariners and the Texas rangers using average salary by year as my metric. The SQL query groups the player salaries by team and year and then calculates the average salary for each. The results are from the range years of 1985 to 2016, during this time the avaerage salary was over $6 million. The Marineres showed growth but were slightly behind in the end. This comparison gave me a better sense of how each organization has financially valued its players over time. It also raises questions about how payroll size might relate to team performance, although I’m a mariners fan, I realize we lost many games so it would be interesting to see a performance to salary analysis.

Show the code
query_task3_mariners_rangers = """
SELECT 
    teamID, 
    yearID, 
    ROUND(AVG(salary), 2) AS avg_salary
FROM 
    Salaries
WHERE 
    teamID IN ('SEA', 'TEX')
GROUP BY 
    teamID, yearID
ORDER BY 
    yearID;
"""

df_task3_mr = pd.read_sql_query(query_task3_mariners_rangers, con)
df_task3_mr
teamID yearID avg_salary
0 SEA 1985 256277.78
1 TEX 1985 383825.00
2 SEA 1986 229165.73
3 TEX 1986 259350.73
4 SEA 1987 251500.00
... ... ... ...
59 TEX 2014 4677294.13
60 SEA 2015 4888348.00
61 TEX 2015 4791426.30
62 SEA 2016 4845833.54
63 TEX 2016 6070300.79

64 rows × 3 columns

The chart shows that both teams started with lower salaries in the 1980s, but average pay steadily increased over time. While both teams experienced ups and downs, the Texas Rangers had more dramatic spikes and sharper increases compared to the Mariners. The Mariners’ salary growth was more consistent and gradual, whereas the Rangers ended up on top with more noticeable jumps throughout the years.

Show the code
#the plot
from lets_plot.mapping import as_discrete

team_colors = {
    'SEA': '#0C2C56',  # Mariners - Navy
    'TEX': '#C0111F'   # Rangers - Red
}

ggplot(df_task3_mr, aes(x='yearID', y='avg_salary', color=as_discrete('teamID'))) + \
    geom_line(size=1.5) + \
    scale_color_manual(values=team_colors) + \
    scale_x_continuous(breaks=list(range(df_task3_mr['yearID'].min(), df_task3_mr['yearID'].max()+1))) + \
    ggtitle("Average Salary: Seattle Mariners vs Texas Rangers") + \
    xlab("Year") + \
    ylab("Average Salary (USD)") + \
    theme_minimal()

STRETCH QUESTION|TASK 1

Advanced Salary Distribution by Position (with Case Statement):

* Write an SQL query that provides a summary table showing the average salary for each position (e.g., pitcher, catcher, outfielder). Position information can be found in the fielding table in the POS column. 

    Include the following columns:

    * position
    * average_salary
    * total_players
    * highest_salary  

* The highest_salary column should display the highest salary ever earned by a player in that position. 

* Additionally, create a new column called salary_category using a case statement:  

    * If the average salary is above $3 million, categorize it as “High Salary.”
    * If the average salary is between $2 million and $3 million, categorize it as “Medium Salary.”
    * Otherwise, categorize it as “Low Salary.”  

* Order the table by average salary in descending order.

**Hint:** Beware, it is common for a player to play multiple positions in a single year. For this analysis, each player’s salary should only be counted toward one position in a given year: the position at which they played the most games that year. This will likely require a (sub-query)[https://docs.data.world/documentation/sql/concepts/advanced/WITH.html].

type your results and analysis here

Show the code
# Include and execute your code here
salary_by_position_query = """
WITH PrimaryPosition AS (
    SELECT 
        playerID,
        yearID,
        POS,
        MAX(G) AS games_played
    FROM (
        SELECT 
            playerID, 
            yearID, 
            POS, 
            SUM(G) AS G
        FROM 
            Fielding
        GROUP BY 
            playerID, yearID, POS
    )
    GROUP BY 
        playerID, yearID
    HAVING 
        G = MAX(G)
),

PositionSalary AS (
    SELECT 
        pp.POS AS position,
        s.playerID,
        s.salary
    FROM 
        Salaries s
    JOIN 
        PrimaryPosition pp ON s.playerID = pp.playerID AND s.yearID = pp.yearID
)

SELECT 
    position,
    ROUND(AVG(salary), 2) AS average_salary,
    COUNT(DISTINCT playerID) AS total_players,
    MAX(salary) AS highest_salary,
    CASE 
        WHEN AVG(salary) > 3000000 THEN 'High Salary'
        WHEN AVG(salary) BETWEEN 2000000 AND 3000000 THEN 'Medium Salary'
        ELSE 'Low Salary'
    END AS salary_category
FROM 
    PositionSalary
GROUP BY 
    position
ORDER BY 
    average_salary DESC;
"""

df_salary_by_position = pd.read_sql_query(salary_by_position_query, con)
df_salary_by_position
position average_salary total_players highest_salary salary_category
0 1B 3336210.69 469 28000000.0 High Salary
1 OF 2405474.54 1119 27328046.0 Medium Salary
2 3B 2321342.06 491 33000000.0 Medium Salary
3 SS 2010018.49 368 22600000.0 Medium Salary
4 P 1938130.80 2556 33000000.0 Low Salary
5 2B 1794492.33 485 24000000.0 Low Salary
6 C 1429781.30 404 23000000.0 Low Salary

STRETCH QUESTION|TASK 2

Advanced Career Longevity and Performance (with Subqueries):

* Calculate the average career length (in years) for players who have played at least **10 games**. Then, identify the top 10 players with the longest careers (based on the number of years they played). Include their: 

    * playerID
    * first_name
    * last_name
    * career_length

* The career_length should be calculated as the difference between the maximum and minimum yearID for each player.  

type your results and analysis here

Show the code
career_query = """
WITH CareerSpan AS (
    SELECT 
        a.playerID,
        MIN(a.yearID) AS start_year,
        MAX(a.yearID) AS end_year,
        (MAX(a.yearID) - MIN(a.yearID) + 1) AS career_length,
        SUM(a.G_all) AS total_games
    FROM 
        Appearances a
    GROUP BY 
        a.playerID
    HAVING 
        total_games >= 10
),

TopCareers AS (
    SELECT 
        cs.playerID,
        p.nameFirst AS first_name,
        p.nameLast AS last_name,
        cs.career_length
    FROM 
        CareerSpan cs
    JOIN 
        People p ON cs.playerID = p.playerID
    ORDER BY 
        cs.career_length DESC
    LIMIT 10
)

SELECT 
    * 
FROM 
    TopCareers;
"""

df_top_careers = pd.read_sql_query(career_query, con)
df_top_careers
playerID first_name last_name career_length
0 altroni01 Nick Altrock 36
1 orourji01 Jim O'Rourke 33
2 minosmi01 Minnie Minoso 32
3 olearch01 Charley O'Leary 31
4 lathaar01 Arlie Latham 30
5 mcguide01 Deacon McGuire 29
6 eversjo01 Johnny Evers 28
7 jennihu01 Hughie Jennings 28
8 ryanno01 Nolan Ryan 28
9 streega01 Gabby Street 28
Show the code
avg_career_query = """
WITH CareerSpan AS (
    SELECT 
        playerID,
        MIN(yearID) AS start_year,
        MAX(yearID) AS end_year,
        (MAX(yearID) - MIN(yearID) + 1) AS career_length,
        SUM(G_all) AS total_games
    FROM 
        Appearances
    GROUP BY 
        playerID
    HAVING 
        total_games >= 10
)

SELECT 
    ROUND(AVG(career_length), 2) AS avg_career_length
FROM 
    CareerSpan;
"""

df_avg_career = pd.read_sql_query(avg_career_query, con)
df_avg_career
avg_career_length
0 6.84